Make sure you have the following R packages installed
library(igraph)
library(vegan)
library(raster)
library(rworldmap)
library(betapart)
library(RColorBrewer)
library(stringr)
library(maps)Remember that you can install packages in R with install.packages("package name")
Scientific aspects to define biogeographic regions
The characterization of geographical areas in terms of biodiversity.
Fundamental abstractions of life organization in the planet in response to past or current ecological forces.
Multidimensional biodiversity variation (community turnover)
The region of interest corresponds to the sum of the biodiversity variance which we are interested to classify into geographically definable regions. The dimensions of the region of interest can be as many as the dimensions of biodiversity change. However, depending of the research question or because information limitation ecological research often limits the region of interest. Boundaries to the region of interest can be applied in geographical space, climatic space, across a temporal axis, and to particular taxonomic groups.
These are some sources of information on planetary scale geographical limits.
The distribution of life over Earth has been a long-studied subject and systematized by Carl Linnaeus and natural historians like Humboldt, Wallace, and Darwin. Decades long of scientific research and natural history increasingly being condensed into large-scale datasets which are freely available in the internet. Digitally available data on species distributions can originally come from various sources such as biodiversity inventories, field research reports, field observations, citizen science, etc. Knowledge on the distribution of life across the plane is more accessible today because of database integration and digital information infrastructures such as the Global Biodiversity Information Facility (GBIF).
There are two main ways how the geographic distribution of species is digitally stored in databases.
This is a simple format to store information on species distribution. At the minimal level a geocoded point involves tree variables which are the 2 spatial coordinates and the species’ (or other taxonomic category) name. This format is probably the most common way which information in the geography of biodiversity is stored in the globe. This simple format make it easy to store and takes little space in memory. However, relying only on geocoded points has is limitations. For instance the spatial patterns can be strongly because of sampling effort. For poorly sampled species, a single new observation can dramatically increase the extent of the species geographic distribution. Similarly, the sparsity of observations may influence community similarity measures. This problem is particularly accentuated in biodiversity rich and/or poorly sampled regions of the globe such as the tropics, mountains, or near the poles.
The delineation of species geographical ranges is second commonly used format to store information on species distribution. Instead of points, species ranges represent a portion of any representation of the geographical surface of the planet. The delineation is often done based the information on species point data + climate models + expert knowledge. Digital information of species ranges usually come stored as vector polygons or rasters. For instance, a common format is the convex hull, which is the smallest set in space that includes all the species points. More sophisticated methods, such as species distribution modelling uses climatic variation to infer species environmental niches by modelling probabilities of species occurrences in space. Conservation agencies such as the IUCN use this species distribution models along taxonomic and local experts to delimit a more continuous surface to represent the spatial range of a species. Species ranges hold more information than geocoded points at the species level, however biodiversity information at this level of detail is only available for a limited number of taxa.
https://www.iucnredlist.org/species/172220/6852079
https://www.iucnredlist.org/species/173248/6979817
| x | y | sp |
|---|---|---|
| 38 | -27 | Sp 4 |
| 17 | -16 | Sp 4 |
| 26 | -1 | Sp 3 |
| 18 | -15 | Sp 2 |
| 29 | -38 | Sp 3 |
| 11 | -31 | Sp 3 |
| 1 | -34 | Sp 1 |
| 16 | -25 | Sp 3 |
| 36 | -2 | Sp 3 |
| 33 | -18 | Sp 5 |
For this tutorial our region of interest will be the regions of North America that corresponds to the United States and Canada. We will focus on the biodiversity variation of bryophytes. Bryophytes are non-vascular plants that occur in many habitats across the globe. Bryophytes include liverworts, mosses, and hornworts. Bryophytes have been sampled extensively in various regions of the Atlantic and Pacific zones of Canada and the United States. Recently, the Beaty Museum of the University of British Columbia made available the information on their entire collection of bryophyte occurrences. The dataset is freely available through GBIF. I previously downloaded the dataset from the GBIF, if you want to download it yourself it is here: https://www.gbif.org/dataset/4edd9396-59df-4b01-9e29-dc21a59f9963
This dataset comes from the bryophyte collections of the University of British Columbia.
https://beatymuseum.ubc.ca/research-2/collections/herbarium/herbarium-bryophytes/
Let’s load the dataset, which is currently in the data folder stored as a .csv file.
briofCan <- read.csv("data/BriophitesCAN.csv",
header = T,
sep = "\t") # note we are specifing a tab-delimited file.
head(briofCan)## gbifID datasetKey
## 1 1987794314 4edd9396-59df-4b01-9e29-dc21a59f9963
## 2 1987792636 4edd9396-59df-4b01-9e29-dc21a59f9963
## 3 1987792129 4edd9396-59df-4b01-9e29-dc21a59f9963
## 4 1987793965 4edd9396-59df-4b01-9e29-dc21a59f9963
## 5 1987793938 4edd9396-59df-4b01-9e29-dc21a59f9963
## 6 1987792195 4edd9396-59df-4b01-9e29-dc21a59f9963
## occurrenceID kingdom phylum
## 1 A77D8408-E621-B14B-ACC8-022FA5867C3B Plantae Bryophyta
## 2 A7856BA9-045B-D940-9EF8-86F982F61919 Plantae Bryophyta
## 3 A78F39BC-D010-F747-B245-24B9E067AFD3 Plantae Bryophyta
## 4 A796CD2C-24D6-2C46-89DD-E4A5E9B51738 Plantae Bryophyta
## 5 A798A3E3-65C6-1948-A621-B5ACA58AC3D7 Plantae Marchantiophyta
## 6 A79A61B8-6810-4243-9CA6-A7EC4872D999 Plantae Bryophyta
## class order family genus
## 1 Bryopsida Grimmiales Grimmiaceae Dryptodon
## 2 Andreaeopsida Andreaeales Andreaeaceae Andreaea
## 3 Bryopsida Grimmiales Grimmiaceae Dryptodon
## 4 Bryopsida Dicranales Dicranaceae Pilopogon
## 5 Jungermanniopsida Jungermanniales Plagiochilaceae Plagiochila
## 6 Bryopsida Pottiales Pottiaceae Hymenostylium
## species infraspecificEpithet taxonRank
## 1 Dryptodon patens SPECIES
## 2 Andreaea rupestris rupestris SUBSPECIES
## 3 Dryptodon patens SPECIES
## 4 Pilopogon guadalupensis SPECIES
## 5 Plagiochila schofieldiana SPECIES
## 6 Hymenostylium recurvirostrum SPECIES
## scientificName
## 1 Dryptodon patens Bridel, 1826
## 2 Andreaea rupestris subsp. rupestris
## 3 Dryptodon patens Bridel, 1826
## 4 Pilopogon gracilis (Hook.) Brid.
## 5 Plagiochila schofieldiana Inoue
## 6 Bryoerythrophyllum recurvirostrum (Hedw.) P.C.Chen
## verbatimScientificName
## 1 Dryptodon patens (Dicks. ex Hedw.) Brid.
## 2 Andreaea rupestris Hedw. subsp. rupestris
## 3 Dryptodon patens (Dicks. ex Hedw.) Brid.
## 4 Pilopogon gracilis (Hook.) Brid.
## 5 Plagiochila schofieldiana H. Inoue
## 6 Bryoerythrophyllum recurvirostre (Hedw.) Chen
## verbatimScientificNameAuthorship countryCode
## 1 (Dicks. ex Hedw.) Brid. CA
## 2 CA
## 3 (Dicks. ex Hedw.) Brid. US
## 4 (Hook.) Brid. PA
## 5 H. Inoue US
## 6 (Hedw.) Chen CA
## locality
## 1 Paulson Bridge on Rte 3 between Castlegar and Christina Lake
## 2 Princess Royal Island, Chapele Inlet; stream canyon near mouth, E side
## 3 Adak Quadrangle, Adak Is., Aleutian Islands, Mt. Reed; Co:
## 4 cerro colorado, 4.3 i. above chami camp. roadside along the ridge, ca 3 mi.
## 5 NW of Alexai Point, S slope of Gilbert Ridge, Attu Island, Aleutian Is.
## 6 Canoe Lake, S slope of ridge E of Canoe Lake; Co:
## stateProvince occurrenceStatus individualCount
## 1 British Columbia PRESENT NA
## 2 British Columbia PRESENT NA
## 3 Alaska PRESENT NA
## 4 Bocas Del Toro PRESENT NA
## 5 Alaska PRESENT NA
## 6 Northwest Territories PRESENT NA
## publishingOrgKey decimalLatitude decimalLongitude
## 1 b542788f-0dc2-4a2b-b652-fceced449591 49.2 -118.1
## 2 b542788f-0dc2-4a2b-b652-fceced449591 52.9 -129.1
## 3 b542788f-0dc2-4a2b-b652-fceced449591 51.8 -176.7
## 4 b542788f-0dc2-4a2b-b652-fceced449591 8.6 -81.8
## 5 b542788f-0dc2-4a2b-b652-fceced449591 52.9 173.2
## 6 b542788f-0dc2-4a2b-b652-fceced449591 68.2 -135.9
## coordinateUncertaintyInMeters coordinatePrecision elevation elevationAccuracy
## 1 7303 NA NA
## 2 6854 NA NA
## 3 7006 NA NA
## 4 11024 NA NA
## 5 6854 NA NA
## 6 4170 NA NA
## depth depthAccuracy eventDate day month year taxonKey speciesKey
## 1 NA NA 1977-08-05T00:00:00 5 8 1977 5281814 5281814
## 2 NA NA 1986-08-09T00:00:00 9 8 1986 7347162 5283962
## 3 NA NA 1986-08-06T00:00:00 6 8 1986 5281814 5281814
## 4 NA NA 1986-06-21T00:00:00 21 6 1986 8232977 2675529
## 5 NA NA 2000-08-12T00:00:00 12 8 2000 8003542 8003542
## 6 NA NA 1965-06-20T00:00:00 20 6 1965 2670498 7347186
## basisOfRecord institutionCode collectionCode catalogNumber recordNumber
## 1 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B18100 67210
## 2 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B107037 86589
## 3 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B207029 1986-187
## 4 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B110795 5341
## 5 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B185157 115694
## 6 PRESERVED_SPECIMEN ca.ubc UBCBryophytes B81228 plot no. 31-6
## identifiedBy dateIdentified license rightsHolder
## 1 CC0_1_0 University of British Columbia
## 2 B.M. Murray,1988 CC0_1_0 University of British Columbia
## 3 W.B. Schofield CC0_1_0 University of British Columbia
## 4 CC0_1_0 University of British Columbia
## 5 W.B. Schofield CC0_1_0 University of British Columbia
## 6 O. Lee, 1985 CC0_1_0 University of British Columbia
## recordedBy typeStatus establishmentMeans
## 1 W.B. Schofield, G.F. Otto
## 2 W.B. Schofield
## 3 Stephen S. Talbot
## 4 B.H. Allen
## 5 W.B. Schofield, S.S. Talbot
## 6 J. Lambert, D. Morrison
## lastInterpreted mediaType
## 1 2020-12-15T22:17:01.559Z
## 2 2020-12-15T22:17:01.559Z
## 3 2020-12-15T22:17:01.559Z
## 4 2020-12-15T22:17:01.560Z
## 5 2020-12-15T22:17:01.560Z
## 6 2020-12-15T22:17:01.560Z
## issue
## 1 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 2 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 3 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 4 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 5 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID
## 6 GEODETIC_DATUM_ASSUMED_WGS84;GEODETIC_DATUM_INVALID;TAXON_MATCH_FUZZY
## [1] 4517
## [1] 6012
Let’s observe the geographic information contained in the dataset
plot(briofCan$decimalLatitude~ briofCan$decimalLongitude,
pch = ".",
frame = F,
main = "Bryophytes recorded in dataset",
xlab = "Long",
ylab = "Lat")The raw dataset is extensive, and there are some data which is obviously misplaced.
We must have a clean version of the dataset before we start doing any analyses on it.
Let’s start by removing all fields that have no associated geographic information.
coords <- data.frame("x"=briofCan$decimalLongitude,
"y"=briofCan$decimalLatitude)
# only georeferenced recors
briofCan <- briofCan[complete.cases(coords),]Second, let’s set our ROI within the continental political boundaries of Canada and the United States of America to do some more housekeeping.
# subset coords
US_CAN <- countriesLow[countriesLow$SOVEREIGNT %in% c("Canada", "United States of America"),]
coords <- coords[complete.cases(coords),]
code <- over(SpatialPoints(coords,
proj4string = crs(US_CAN)),
US_CAN)$POSTAL
coords$code <- code
briofCanClean <- briofCan[coords$code %in% c("CA", "US"),]
briofCanClean <- droplevels(briofCanClean[briofCanClean$phylum == "Bryophyta",])
# clean hawai (to keep records only at the continental shelf)
briofCanClean <- briofCanClean[briofCanClean$decimalLatitude > 30 & briofCanClean$decimalLongitude <0,]
# How many observations are we left with?
length(unique(briofCanClean$speciesKey))## [1] 1205
## [1] 1681
# let's plot the geographic distribution of the clean data
plot(decimalLatitude~ decimalLongitude,
xlab = "Longitude",
ylab = "Latitude",
pch = 15,
cex = 0.1,
col = "red",
data = briofCanClean,
frame = F)
maps::map(regions = c("USA", "Canada"),fill = F,
col = "black",
add = T)Resolution and scale is important to consider before we start to examine biodiversity variation in space. Importantly, because of data limitation at larger scales often is necessary to aggregate point pattern data. The chosen grid resolution both at the area and limits can influence the overall patterns we may be able to identify.
Discussion point:
How different choices of grid resolution affect our observations of spatial patterns in biodiversity?
What is the relationship between resolution and scale?
imgAgg
For this tutorial we will define biogeographic regions using a spatial resolution of 1x1 degree. This means that we will consider all species in a grid as separate biological “communities”. It is important to discuss the ecological implications of this assumption. For instance, we can question whether the variation in grids represents accurate representations of the ecological forces defining the coexistence of species into communities. Similarly, we can question whether the biodiversity variance is homogeneous across all grids. For example, within grid variance is expected to be high in biodiversity rich areas such as the tropical rain forest in contrast to biodiversity poor regions such as desserts. For continental scale studies 1x1 degree is often a standard.
## [1] 30.18 82.50
# lets round to 1 degree...
briofCanClean$latRound <- round(briofCanClean$decimalLatitude)
briofCanClean$lonRound <- round(briofCanClean$decimalLongitude)
# get a unique id to identify each grid
briofCanClean$unID <- paste(briofCanClean$latRound,
briofCanClean$lonRound,
sep = "_")
plot(briofCanClean$lonRound,
briofCanClean$latRound,
xlab = "Longitude",
ylab = "Latitude",
pch = 15,
cex = 0.4,
col = "firebrick",
frame = F)
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T)gridTab <- table(briofCanClean$unID,
briofCanClean$scientificName )
gridTab2 <- gridTab # sampling effort
gridTab[gridTab > 1 ] <- 1 # p/a
hist(log(rowSums(gridTab2)+1))plotAmap <- function(gridTab, main ){
dat <- data.frame(
"div"= rowSums(gridTab),
data.frame(
apply(stringr::str_split(rownames(gridTab), "_", simplify= T),
2,
as.numeric)
)
)
plot(dat$X2, dat$X1,
main = main,
pch = ".",
frame= F,
xlab = "Long",
ylab= "Lat",
cex = log(dat$div))
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T)
}
par(mfrow = c(1,2), mar = c(0,0,4,0))
plotAmap(gridTab, main = "Richness distribution")
plotAmap(gridTab2, main = "Effort distribution")This matrix holds information on the presence/absence or relative abundances of a set of species throughout a series of localities (sites). In our tutorial, the sites correspond to each of the 1x1 degree grids. For this tutorial we will focus only on the aspect of biodiversity change that comes from the variance of species presences and absences in the ROI. Therefore, entries in the species-site matrix are coded as 1 if a species \(j\) is present at a grid \(n\), and 0 otherwise. Row vectors of this matrix represent the spatial variation of each species across the ROI. Column vectors represent the composition of a site in terms of species.
\[\begin{bmatrix} & site1 & site2 & ... & site_n \\Sp1 & 1 & 0 & ...& 1 \\Sp2 & 0 & 1 & ...& 1 \\Sp3 & 1 & 1 & ...& 0 \\Sp4 & 1 & 0 & ...& 1 \\\vdots & \vdots & \vdots & \vdots & \vdots \\Sp_j & 1 & 1 & 1 & 0 \end{bmatrix}\]
Biogeographic regions are defined by clustering the variance of the species dissimilarities among sites. This differences in species composition is commonly known as species turnover and it is a separate component (ß) of biodiversity. We will compute pairwise dissimilarities in grid species composition. Pairwise dissimilarities in species composition among communities are in function of the number of shared species and the sum of the species diversity of a pairwise set of communities. There are several metrics of community dissimilarity. I encourage you to explore that further in your own. In this tutorial we will use the Simpson dissimilarity index as a measure of species turnover.
\[ S_i = \frac{a}{a + min(b,c)} \]
being \(a\) the number of shared species and \(min(b,c)\) the minimum number of species between two communities.
There are several advantages of Simpson dissimilarity index relative to other indexes. For instance Simpsons uses only the poorest community in the denominator to account for large differences in richness between communities. This means that the index is not biased by strong richness differences between sites, therefore removing the contribution of nestedness to the pairwise dissimilarities
Let’s define a function to compute this index and apply it to our data to obtain a dissimilarity matrix .
# Function to compute the simpson dissimilarity index among sites in our matrix and return a dissimilarity matrix
# param: SpTab = species-site matrix
getDisMat <- function(SpTab){
# How to measure similarity?
# presence-absence
SpTabPA <-SpTab
SpTabPA[SpTabPA>1] <- 1
# Simpson dissimilarity index
# Calcualate "simpson dissimilarity index" (Bsim part of sorensen dissimilarity)
part <- betapart::beta.pair(SpTabPA,
"sorensen")
# summary(part$beta.sim) ## access the portion of the object that correspond to the Simpson index of dissmilarity
distMat <- part$beta.sim
return(distMat)
}
distMat <- getDisMat(SpTab)Lets observe the variance of the dissimilarity matrix, how many clusters can you observe by eyeballing a heatmap?
# visualize single dimensions in space.
# compute the variance of grids
varVec <- sapply(1:nrow(as.matrix(distMat)), function(x) var(as.matrix(distMat)[,x]))
# function to colorize
f <- function(x,n=10, pal, rev = F){
if(rev == F){
rev(RColorBrewer::brewer.pal(n, pal))[cut(x,n)]
}else{
(RColorBrewer::brewer.pal(n, pal))[cut(x,n)]
}
}
plotFocalPoint <- function(distMat, varVec){
focalPoint <- data.frame("FP"= as.matrix(distMat)[,which(varVec == sample(varVec,1))])
distPlot <- data.frame(stringr::str_split(
rownames(focalPoint),
"_",
simplify = T),
"dist" = focalPoint[,1])
names(distPlot) <- c("lat", "lon", "val")
distPlot <- data.frame(
apply(distPlot,
2,
as.numeric))
plot(distPlot$lat~distPlot$lon,
xlim = c(-170,-50),
ylim = c(20,90),
xaxt = "n",
yaxt = "n",
xlab = "",
ylab = "",
pch = 15, cex = 0.5)
maps::map(regions = c("USA","Canada"),
fill = T,
col = "white",
add = T)
points(distPlot$lat~distPlot$lon,
pch = 16,
cex = 0.7,
col = scales::alpha(
f(log(distPlot$val+1),
9,
"YlOrRd", T),
1-log(distPlot$val+1)))
maps::map(regions = c("USA", "Canada"),
fill = F,
col = "black",
add = T)
}
par(mfrow = c(4,2), mar = c(0,0,0,0), bg = "gray")
for(i in 1:8){
plotFocalPoint(distMat,varVec)
}Unless we are interested in the decay of biodiversity turnover from a focal point having many dimensions is not so informative. We have to try to cluster all these variation into fewer, more analytically tractable dimensions.
We will start by removing some potential source of noise in our dataset. We must remember that we deal with incomplete records so there is the potential for non-ecologically relevant sources of variation. For instance, poorly sampled localities may inflate the total sum of pairwise dissimilarities, reducing our ability to disentangle patterns. There is not a magic number of grids or species to preserve but it rather depends to the study objectives, dataset characteristics, and study scale and resolution.
# remove potential noise
SpTab2 <- SpTab[rowSums(SpTab) > 10,
-which(colSums(SpTab)
%in% c(1:2))]
distMat2 <- getDisMat(SpTab2)There are two main ways how we can approach this problem of linearizing multidimensional variation in community dissimilarities to define biogeographic regions. We can classify or communities into discrete categories or continuously ordinate them in fewer multivariate dimensions.
Ordination are a set of statistical techniques used principally to reduce the dimensionality of a matrix. In our case we can think about sites and species as dimensions. Using this vector we can construct the most parsimonious organization in 2d (or more) space.
We will use a technique called Non-Metric Multidimensional (NMDS) scaling to find the most parsimonious arrangement of our distance-matrix items that preserve the community wide pattern of pairwise distances between species and sites.
We will not to go into details of the procedure, but if you need a reminder of ordination methods or the mathematical basis of NMDS you can find information in the following links:
Ordination:
http://ordination.okstate.edu/overview.htm
https://wiki.qcbs.ca/r_workshop9
NMDS:
http://www.flutterbys.com.au/stats/tut/tut15.1.html
http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf
## Run 0 stress 0.153179
## Run 1 stress 0.1531994
## ... Procrustes: rmse 0.003408079 max resid 0.0705506
## Run 2 stress 0.1532109
## ... Procrustes: rmse 0.004187233 max resid 0.074795
## Run 3 stress 0.1531694
## ... New best solution
## ... Procrustes: rmse 0.003777082 max resid 0.06345932
## Run 4 stress 0.1531988
## ... Procrustes: rmse 0.006901958 max resid 0.1398042
## Run 5 stress 0.1531668
## ... New best solution
## ... Procrustes: rmse 0.002584285 max resid 0.02119809
## Run 6 stress 0.1532305
## ... Procrustes: rmse 0.004935018 max resid 0.1035602
## Run 7 stress 0.1539757
## Run 8 stress 0.1531778
## ... Procrustes: rmse 0.002628541 max resid 0.02341708
## Run 9 stress 0.1531662
## ... New best solution
## ... Procrustes: rmse 0.002879594 max resid 0.03384198
## Run 10 stress 0.1532283
## ... Procrustes: rmse 0.003190064 max resid 0.04515333
## Run 11 stress 0.1531971
## ... Procrustes: rmse 0.005398809 max resid 0.1039415
## Run 12 stress 0.1531548
## ... New best solution
## ... Procrustes: rmse 0.001289026 max resid 0.02266371
## Run 13 stress 0.1531744
## ... Procrustes: rmse 0.00244465 max resid 0.01041041
## Run 14 stress 0.1531904
## ... Procrustes: rmse 0.005846499 max resid 0.115119
## Run 15 stress 0.1531797
## ... Procrustes: rmse 0.002534016 max resid 0.05246228
## Run 16 stress 0.153185
## ... Procrustes: rmse 0.002518792 max resid 0.01755585
## Run 17 stress 0.1532033
## ... Procrustes: rmse 0.004908027 max resid 0.09179624
## Run 18 stress 0.153258
## ... Procrustes: rmse 0.002494914 max resid 0.04999442
## Run 19 stress 0.1532058
## ... Procrustes: rmse 0.005635713 max resid 0.1173142
## Run 20 stress 0.1532393
## ... Procrustes: rmse 0.00259549 max resid 0.05284291
## *** No convergence -- monoMDS stopping criteria:
## 8: no. of iterations >= maxit
## 12: stress ratio > sratmax
## species scores not available
This approach consider the species-site matrix as a bipartite graph. Vertices (nodes) of this graph are either species or grids. Edges (links) are connecting vertices based on the occurrence of a given species in a given grid. Without going to details, the spectrum of this graph, correspond to the set of eigenvalues and eigenvectors of the adjacency or laplacian matrices of the graph. Eigenvalues and eigenvectors represent the unique properties of a given matrix configuration. Importantly, the largest eigenvalues and associated eigenvectors contain information on the nested and modular structure of a graph.
SpecScore <- scores(rda(data.frame(eiVal$vectors)))
hist(log(abs(eiVal$vectors[1:nrow(SpTab2)])),
main = "",
xlab = "2 Eigenvector (log)")nm <- V(SpGraph)$name[1:nrow(SpTab2)]
eiVal <- data.frame(data.frame(apply(stringr::str_split(nm, "_", simplify = T),2, as.numeric)),
"eig" = eiVal$vectors[1:nrow(SpTab2),],
"Score" = SpecScore$sites[1:nrow(SpTab2),])k-means is a fast clustering algorithm. The k-means algorithm searches to partition the variance of a matrix into groups of k clusters centered around k means.
This is common methodology for *un-supervised" classification tasks. That is, we let the data to tell us the best partition into separate classes.
steps for the K-means algorithm (from wikipedia)
K-means algorithm assumes euclidean distances to compute k clusters. Simpson dissimilarity index index does not project linearly into euclidean space. The Hellinger standardization is a workaround which let’s us use euclidean metrics with community dissimilarity metrics.
\[y_{i,j} = \sqrt{\frac{y_{i,j}}{y_i}}\]
The main limitation of the k-means algorithm is that the cluster association vector is always dependent on the values of k. How do we know the best k then?. This therefore becomes an optimization problem for the value of k.
A solution to find the best k is to iterate the k-means algorithm for a series of k within a range of reasonable estimate for the number of clusters given our community matrix. (e.g k << S or k < S not k = S or k > S)
We optimize the ratio between the variance within k given clusters in relation with the total amount of variation in the matrix. We select the k parameter which the sum of squares of the distances between k clusters (i.e. the distances of the k-centroids to its mean centroid) is closest to the sum of squares of the whole matrix.
# standardize distances and apply k-means for a range of k.
cascadeKM <- vegan::cascadeKM(vegan::decostand(distMat2,
"hellinger"), # standardization of distances
inf.gr = 2, # lowest k tested
sup.gr = 10) # highest k testedLets observe the change on the sum of squares and the similar Calinski criterion after applying the k-means for a range of \(k={2,3,4,5,6,7,8,9,10}\)
KM2 <- cascadeKM$partition[,1]
KM3 <- cascadeKM$partition[,2]
KM4 <- cascadeKM$partition[,3]
KM <- data.frame(KM2,KM3,KM4,
apply(stringr::str_split(names(KM2), pattern = "_", simplify = T), 2, as.numeric) )
head(KM)## KM2 KM3 KM4 X1 X2
## 31_-90 1 1 3 31 -90
## 32_-106 1 1 3 32 -106
## 33_-106 1 1 3 33 -106
## 33_-107 1 1 3 33 -107
## 33_-108 1 1 3 33 -108
## 34_-78 1 1 3 34 -78
## Warning in RColorBrewer::brewer.pal(n, pal): minimal value for n is 3, returning requested palette with 3 different levels
WORLDCLIM <-raster::getData("worldclim",var="bio",res=10 )
BIO1 <- (raster::intersect(WORLDCLIM$bio1,
SpatialPoints(data.frame(mdsScor$X2,mdsScor$X1),
proj4string = WORLDCLIM$bio1@crs)))
BIO12 <- (raster::intersect(WORLDCLIM$bio12,
SpatialPoints(data.frame(mdsScor$X2,mdsScor$X1),
proj4string = WORLDCLIM$bio1@crs)))climSpace = data.frame(coordinates(BIO1),
"BIO1" = getValues(BIO1))
climSpace2 = data.frame(coordinates(BIO12),
"BIO12" = getValues(BIO12))
climSpace$unID <- paste(round(climSpace$x),
round(climSpace$y),
sep = "_")
climSpace2$unID <- paste(round(climSpace2$x),
round(climSpace2$y),
sep = "_")
climSpace$Bio12 <- climSpace2$BIO12[match(climSpace$unID, climSpace2$unID)]
climSpace <- droplevels(climSpace[complete.cases(climSpace),])colEigen = (eiVal$Score.PC2[match(climSpace$unID,
paste(eiVal$X2,
eiVal$X1,
sep = "_"))])
colPC1 = (mdsScor$PC1[match(climSpace$unID,
paste(mdsScor$X2,
mdsScor$X1,
sep = "_"))])
colKM = (KM$KM4[match(climSpace$unID,
paste(KM$X2,
KM$X1,
sep = "_"))])plot(scale(climSpace$BIO1),scale(climSpace$Bio12),
col = f((colKM), 4,"Spectral"),cex = 0.7,
xlab = "Mean annual temperature",
ylab = "Mean annual precipitation",
frame = F,
pch = 18)model <-lm(log(colPC1+10)~climSpace$BIO1+climSpace$Bio12,na.action = "na.exclude")
model1 <-lm(log(colPC1+10)~climSpace$BIO1*climSpace$Bio12,na.action = "na.exclude")
model2 <-lm(colPC1~climSpace$BIO1+climSpace$Bio12, na.action = "na.exclude")
model3 <-lm(colPC1~climSpace$BIO1*climSpace$Bio12, na.action = "na.exclude")
model4 <-lm(colKM~climSpace$BIO1*climSpace$Bio12, na.action = "na.exclude")
AIC(model, model1,model2, model3, model4)## df AIC
## model 4 -36888.89
## model1 5 -36904.64
## model2 4 38546.29
## model3 5 38520.50
## model4 5 34497.15
## Analysis of Variance Table
##
## Response: log(colPC1 + 10)
## Df Sum Sq Mean Sq F value Pr(>F)
## climSpace$BIO1 1 2.010 2.0104 417.28 < 2.2e-16 ***
## climSpace$Bio12 1 4.398 4.3977 912.80 < 2.2e-16 ***
## Residuals 14769 71.154 0.0048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(climSpace$x, climSpace$y,
cex = 0.7, frame = F,main = "residuals",
xlab = "Long",
ylab = "Lat",
col = f(abs(climSpace$res), 5, "Reds", T), pch = ".")
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)BIO1c <- intersect(BIO1,BIO12)
pred1 <- predict( model,data.frame(climSpace$BIO1,climSpace$Bio12))
pred2 <- predict( model2,data.frame(climSpace$BIO1,climSpace$Bio12))
pred3 <- predict( model3,data.frame(climSpace$BIO1,climSpace$Bio12))
pred <- data.frame(climSpace, pred1,pred2,pred3)
par(mfrow = c(1,3), mar = c(0,0,0,0))
plot(pred$x,pred$y, col = f(pred$pred1, 11, "Spectral"),
pch = ".", cex = 3)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)
plot(pred$x,pred$y, col = f(pred$pred2, 11, "Spectral"),
pch = ".", cex = 3)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)
plot(pred$x,pred$y, col = f(pred$pred3, 11, "Spectral"),
pch = ".", cex = 3)
maps::map(regions = c("USA", "Canada"),fill = F,col = "black",
add = T)Carter et al., 2016. Endemic regions of mosses in North America